程序代写代做代考 algorithm html COMP6714:
Informa2on
Retrieval
&
Web
Search


COMP6714:
Informa2on
Retrieval
&
Web
Search

Introduc)on
to

Informa(on
Retrieval

Lecture
6:
Scoring,
Term
Weigh)ng
and
the
 Vector
Space
Model

1


COMP6714:
Informa2on
Retrieval
&
Web
Search

Recap
of
lecture
5

 Collec)on
and
vocabulary
sta)s)cs:
Heaps’
and
Zipf’s
laws
  Dic)onary
compression
for
Boolean
indexes

 Dic)onary
string,
blocks,
front
coding

 Pos)ngs
compression:
Gap
encoding,
prefix‐unique
codes

 Variable‐Byte,
Gamma
codes,
Golomb/Rice
codes
 MB
collection (text, xml markup etc)
3,600.0
collection (text)
960.0
Term-doc incidence matrix
40,000.0
postings, uncompressed (32-bit words)
400.0
postings, uncompressed (20 bits)
250.0
postings, variable byte encoded
116.0
postings, γ-encoded
101.0
2


COMP6714:
Informa2on
Retrieval
&
Web
Search

This
lecture;
IIR
Sec)ons
6.2‐6.4.3

 Ranked
retrieval

 Scoring
documents
  Term
frequency

 Collec)on
sta)s)cs
  Weigh)ng
schemes
  Vector
space
scoring

3


COMP6714:
Informa2on
Retrieval
&
Web
Search

Ch. 6
Ranked
retrieval

 Thus
far,
our
queries
have
all
been
Boolean.

 Documents
either
match
or
don’t.

 Good
for
expert
users
with
precise
understanding
of

their
needs
and
the
collec)on.

 Also
good
for
applica)ons:
Applica)ons
can
easily
 consume
1000s
of
results.

 Not
good
for
the
majority
of
users.

 Most
users
incapable
of
wri)ng
Boolean
queries
(or
they

are,
but
they
think
it’s
too
much
work).

 Most
users
don’t
want
to
wade
through
1000s
of
results.
  This
is
par)cularly
true
of
web
search.

4


COMP6714:
Informa2on
Retrieval
&
Web
Search

Ch. 6
Problem
with
Boolean
search:
 feast
or
famine

 Boolean
queries
o^en
result
in
either
too
few
(=0)
or
 too
many
(1000s)
results.

 Query
1:
“standard
user
dlink
650”
→
200,000
hits

 Query
2:
“standard
user
dlink
650
no
card
found”:
0

hits

 It
takes
a
lot
of
skill
to
come
up
with
a
query
that
 produces
a
manageable
number
of
hits.

 AND
gives
too
few;
OR
gives
too
many

5


COMP6714:
Informa2on
Retrieval
&
Web
Search

Ranked
retrieval
models

 Rather
than
a
set
of
documents
sa)sfying
a
query
 expression,
in
ranked
retrieval
models,
the
system
 returns
an
ordering
over
the
(top)
documents
in
the
 collec)on
with
respect
to
a
query

 Free
text
queries:
Rather
than
a
query
language
of
 operators
and
expressions,
the
user’s
query
is
just
 one
or
more
words
in
a
human
language

 In
principle,
there
are
two
separate
choices
here,
but
 in
prac)ce,
ranked
retrieval
models
have
normally
 been
associated
with
free
text
queries
and
vice
versa

6


COMP6714:
Informa2on
Retrieval
&
Web
Search

Ch. 6
Feast
or
famine:
not
a
problem
in
 ranked
retrieval

 When
a
system
produces
a
ranked
result
set,
large
 result
sets
are
not
an
issue

 Indeed,
the
size
of
the
result
set
is
not
an
issue
  We
just
show
the
top
k
(
≈
10)
results

 We
don’t
overwhelm
the
user

 Premise:
the
ranking
algorithm
works

7


COMP6714:
Informa2on
Retrieval
&
Web
Search

Ch. 6
Scoring
as
the
basis
of
ranked
retrieval

 We
wish
to
return
in
order
the
documents
most
likely
 to
be
useful
to
the
searcher

 How
can
we
rank‐order
the
documents
in
the
 collec)on
with
respect
to
a
query?

 Assign
a
score
–
say
in
[0,
1]
–
to
each
document

 This
score
measures
how
well
document
and
query
 “match”.

8


COMP6714:
Informa2on
Retrieval
&
Web
Search

Ch. 6
Query‐document
matching
scores

 We
need
a
way
of
assigning
a
score
to
a
query/ document
pair

 Let’s
start
with
a
one‐term
query

 If
the
query
term
does
not
occur
in
the
document:

score
should
be
0

 The
more
frequent
the
query
term
in
the
document,
 the
higher
the
score
(should
be)

 We
will
look
at
a
number
of
alterna)ves
for
this.

9


COMP6714:
Informa2on
Retrieval
&
Web
Search

Ch. 6
Take
1:
Jaccard
coefficient

 Recall
from
Lecture
3:
A
commonly
used
measure
of

overlap
of
two
sets
A
and
B

 jaccard(A,B)
=
|A
∩
B|
/
|A
∪
B|

 jaccard(A,A)
=
1

 jaccard(A,B)
=
0
if
A
∩
B
=
0

 A
and
B
don’t
have
to
be
the
same
size.

 Always
assigns
a
number
between
0
and
1.

10


COMP6714:
Informa2on
Retrieval
&
Web
Search

Ch. 6
Jaccard
coefficient:
Scoring
example

 What
is
the
query‐document
match
score
that
the
 Jaccard
coefficient
computes
for
each
of
the
two
 documents
below?

 Query:
ides
of
march

 Document
1:
caesar
died
in
march

 Document
2:
the
long
march

the
term
Ides
of
March
is
best
known
as
the
date
that
 Julius
Caesar
was
killed
in
709
AUC
or
44
B.C

11


COMP6714:
Informa2on
Retrieval
&
Web
Search

Ch. 6
Issues
with
Jaccard
for
scoring

1 It
doesn’t
consider
term
frequency
(how
many
)mes
 a
term
occurs
in
a
document)

2 Rare
terms
in
a
collec)on
are
more
informa)ve
 than
frequent
terms.
Jaccard
doesn’t
consider
this
 informa)on

 We
need
a
more
sophis)cated
way
of
normalizing
for
 length

12


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.2
Recall
(Lecture
1):
Binary
term‐ document
incidence
matrix

Each document is represented by a binary vector ∈ {0,1}|V| 13


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.2
Term‐document
count
matrices

 Consider
the
number
of
occurrences
of
a
term
in
a
 document:


 Each
document
is
a
count
vector
in
Nv:
a
column
below


14


COMP6714:
Informa2on
Retrieval
&
Web
Search

Bag
of
words
model

 Vector
representa)on
doesn’t
consider
the
ordering

of
words
in
a
document

 John
is
quicker
than
Mary
and
Mary
is
quicker
than
 John
have
the
same
vectors

 This
is
called
the
bag
of
words
model.

 In
a
sense,
this
is
a
step
back:
The
posi)onal
index

was
able
to
dis)nguish
these
two
documents.

 We
will
look
at
“recovering”
posi)onal
informa)on
 later
in
this
course.

 For
now:
bag
of
words
model

15


COMP6714:
Informa2on
Retrieval
&
Web
Search

Term
frequency
t

 The
term
frequency
tt,d
of
term
t
in
document
d
is

defined
as
the
number
of
)mes
that
t
occurs
in
d.

 We
want
to
use
t
when
compu)ng
query‐document

match
scores.
But
how?

 Raw
term
frequency
is
not
what
we
want:

 A
document
with
10
occurrences
of
the
term
is
more

relevant
than
a
document
with
1
occurrence
of
the
term.
  But
not
10
)mes
more
relevant.

 Relevance
does
not
increase
propor)onally
with

term
frequency.

NB:
frequency
=
count
in
IR

16


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.2
Log‐frequency
weigh)ng

 The
log
frequency
weight
of
term
t
in
d
is

 0
→
0,
1
→
1,
2
→
1.3,
10
→
2,
1000
→
4,
etc.

 Score
for
a
document‐query
pair:
sum
over
terms
t
in

both
q
and
d:

 score

 The
score
is
0
if
none
of
the
query
terms
is
present
in

the
document.

17


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.2.1
Document
frequency

 Rare
terms
are
more
informa)ve
than
frequent
terms

 Recall
stop
words

 Consider
a
term
in
the
query
that
is
rare
in
the

collec)on
(e.g.,
arachnocentric)

 A
document
containing
this
term
is
very
likely
to
be

relevant
to
the
query
arachnocentric

 →
We
want
a
high
weight
for
rare
terms
like
 arachnocentric.

18


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.2.1
Document
frequency,
con)nued

 Frequent
terms
are
less
informa)ve
than
rare
terms

 Consider
a
query
term
that
is
frequent
in
the
 collec)on
(e.g.,
high,
increase,
line)

 A
document
containing
such
a
term
is
more
likely
to
 be
relevant
than
a
document
that
doesn’t

 But
it’s
not
a
sure
indicator
of
relevance.

 →
For
frequent
terms,
we
want
high
posi)ve
weights

for
words
like
high,
increase,
and
line

 But
lower
weights
than
for
rare
terms.

 We
will
use
document
frequency
(df)
to
capture
this.
 19


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.2.1
idf
weight

 dft
is
the
document
frequency
of
t:
the
number
of
 documents
that
contain
t

 dft
is
an
inverse
measure
of
the
informa)veness
of
t
  dft

≤
N

 We
define
the
idf
(inverse
document
frequency)
of
t
 by

 We
use
log
(N/dft)
instead
of
N/dft
to
“dampen”
the
effect
 of
idf.

Will turn out the base of the log is immaterial.
20


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.2.1
idf
example,
suppose
N
=
1
million

term
dft
idft
calpurnia
1
animal
100
sunday
1,000
fly
10,000
under
100,000
the
1,000,000
There is one idf value for each term t in a collection. 21


COMP6714:
Informa2on
Retrieval
&
Web
Search

Effect
of
idf
on
ranking

 Does
idf
have
an
effect
on
ranking
for
one‐term
 queries,
like

 iPhone

 idf
has
no
effect
on
ranking
one
term
queries

 idf
affects
the
ranking
of
documents
for
queries
with
at
 least
two
terms

 For
the
query
capricious
person,
idf
weigh)ng
makes
 occurrences
of
capricious
count
for
much
more
in
the
final
 document
ranking
than
occurrences
of
person.

22


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.2.1
Collec)on
vs.
Document
frequency

 The
collec)on
frequency
of
t
is
the
number
of
 occurrences
of
t
in
the
collec)on,
coun)ng
 mul)ple
occurrences.

 Example:

Word
Collection frequency
Document frequency
insurance
10440
3997
try
10422
8760
 Which
word
is
a
bever
search
term
(and
should
 get
a
higher
weight)?

23


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.2.2
t‐idf
weigh)ng

 The
t‐idf
weight
of
a
term
is
the
product
of
its
t
 weight
and
its
idf
weight.

 Best
known
weigh)ng
scheme
in
informa)on
retrieval
  Note:
the
“‐”
in
t‐idf
is
a
hyphen,
not
a
minus
sign!

 Alterna)ve
names:
t.idf,
t
x
idf

 Increases
with
the
number
of
occurrences
within
a
 document

 Increases
with
the
rarity
of
the
term
in
the
collec)on
 24


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.2.2
Final
ranking
of
documents
for
a
query

Score(q,d)=∑t∈q∩d tf.idft,d
25


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.3
Binary
→
count
→
weight
matrix

Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V|
26


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.3
Documents
as
vectors

 So
we
have
a
|V|‐dimensional
vector
space

 Terms
are
axes
of
the
space

 Documents
are
points
or
vectors
in
this
space

 Very
high‐dimensional:
tens
of
millions
of
dimensions
 when
you
apply
this
to
a
web
search
engine

 These
are
very
sparse
vectors
‐
most
entries
are
zero.

27


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.3
Queries
as
vectors

 Key
idea
1:
Do
the
same
for
queries:
represent
them
 as
vectors
in
the
space

 Key
idea
2:
Rank
documents
according
to
their
 proximity
to
the
query
in
this
space

 proximity
=
similarity
of
vectors

 proximity
≈
inverse
of
distance

 Recall:
We
do
this
because
we
want
to
get
away
from
 the
you’re‐either‐in‐or‐out
Boolean
model.

 Instead:
rank
more
relevant
documents
higher
than
 less
relevant
documents

28


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.3
Formalizing
vector
space
proximity
  First
cut:
distance
between
two
points

 (
=
distance
between
the
end
points
of
the
two
vectors)
  Euclidean
distance?

 Euclidean
distance
is
a
bad
idea
.
.
.

 .
.
.
because
Euclidean
distance
is
large
for
vectors
of
 different
lengths.

29


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.3
Why
distance
is
a
bad
idea
 The
Euclidean
distance

between
q
 and
d2
is
large
even

though
the

distribu)on
of
terms
 in
the
query
q
and
the
 distribu)on
of

terms
in
the
 document
d2
are

very
similar.

30


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.3
Use
angle
instead
of
distance

 Thought
experiment:
take
a
document
d
and
append
 it
to
itself.
Call
this
document
d′.

 “Seman)cally”
d
and
d′
have
the
same
content

 The
Euclidean
distance
between
the
two
documents

can
be
quite
large

 The
angle
between
the
two
documents
is
0,
 corresponding
to
maximal
similarity.

 Key
idea:
Rank
documents
according
to
angle
with
 query.

31


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.3
From
angles
to
cosines

 The
following
two
no)ons
are
equivalent.

 Rank
documents
in
decreasing
order
of
the
angle
between
 query
and
document

 Rank
documents
in
increasing
order
of
 cosine(query,document)

 Cosine
is
a
monotonically
decreasing
func)on
for
the
 interval
[0o,
180o]

32


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.3
From
angles
to
cosines

 But
how
–
and
why
–
should
we
be
compu)ng
cosines?

33


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.3
Length
normaliza)on

 A
vector
can
be
(length‐)
normalized
by
dividing
each
 of
its
components
by
its
length
–
for
this
we
use
the
 L2
norm:

 Dividing
a
vector
by
its
L2
norm
makes
it
a
unit
 (length)
vector
(on
surface
of
unit
hypersphere)

 Effect
on
the
two
documents
d
and
d′
(d
appended
 to
itself)
from
earlier
slide:
they
have
iden)cal
 vectors
a^er
length‐normaliza)on.

 Long
and
short
documents
now
have
comparable
weights
 34


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.3
cosine(query,document)

Dot product
Unit vectors
qi is the tf-idf weight of term i in the query
di is the tf-idf weight of term i in the document
cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d.
35


COMP6714:
Informa2on
Retrieval
&
Web
Search

Cosine
for
length‐normalized
vectors

 For
length‐normalized
vectors,
cosine
similarity
is
 simply
the
dot
product
(or
scalar
product):

cos(q,d) = q • d = ∑V qidi i=1



































for
q,
d
length‐normalized.

36


COMP6714:
Informa2on
Retrieval
&
Web
Search

Cosine
similarity
illustrated

37


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.3
Cosine
similarity
amongst
3
documents

How
similar
are
 the
novels
 SaS:
Sense
and
 Sensibility
 PaP:
Pride
and
 Prejudice,
and
 WH:
Wuthering
 Heights?

Term frequencies (counts)
term
SaS
PaP
WH
affection
115
58
20
jealous
10
7
11
gossip
2
0
6
wuthering
0
0
38
Note: To simplify this example, we don’t do idf weighting.
38


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.3
3
documents
example
contd.

Log
frequency
weigh(ng
 A9er
length
normaliza(on

term
SaS
PaP
WH
affection
3.06
2.76
2.30
jealous
2.00
1.85
2.04
gossip
1.30
0
1.78
wuthering
0
0
2.58
term
SaS
PaP
WH
affection
0.789
0.832
0.524
jealous
0.515
0.555
0.465
gossip
0.335
0
0.405
wuthering
0
0
0.588
cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0
≈ 0.94
cos(SaS,WH) ≈ 0.79 cos(PaP,WH) ≈ 0.69
Why do we have cos(SaS,PaP) > cos(SaS,WH)?
39


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.3
Compu)ng
cosine
scores

40


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.4
t‐idf
weigh)ng
has
many
variants

Columns headed ‘n’ are acronyms for weight schemes.
Why is the base of the log in idf immaterial?
41


COMP6714:
Informa2on
Retrieval
&
Web
Search

Sec. 6.4
Weigh)ng
may
differ
in
queries
vs
 documents

 Many
search
engines
allow
for
different
weigh)ngs
 for
queries
vs.
documents

 SMART
Nota)on:
denotes
the
combina)on
in
use
in
 an
engine,
with
the
nota)on
ddd.qqq,
using
the
 acronyms
from
the
previous
table

 A
very
standard
weigh)ng
scheme
is:
lnc.ltc

 Document:
logarithmic
t
(l
as
first
character),
no
idf
and

cosine
normaliza)on

A bad idea?
 Query:
logarithmic
t
(l
in
le^most
column),
idf
(t
in
second
 column),
cosine
normaliza)on
…

42


COMP6714:
Informa2on
Retrieval
&
Web
Search

auto
best
car
t‐idf
example:
lnc.ltc

Document: car insurance auto insurance Query: best car insurance
insurance
0
1
1
1
0
1
1
1
5000
50000
10000
1000
2.3
1.3
2.0
3.0
1.3
2.0
3.0
0
0.34
0.52
0.78
0
1
0
1
2
1.3
1
0
1
wt
1.3
1
0
1
Sec. 6.4
Term
tf- raw
tf-wt
df
Query
0.52
0.52
0.68
0.27
0.53
0
0
0
Exercise: what is N, the number of docs?
Doc length =
idf
wt
n’lize
tf-raw
Document
tf-wt
n’lize
Prod
12 + 02 +12 +1.32 ≈1.92 Score = 0+0+0.27+0.53 = 0.8
43


COMP6714:
Informa2on
Retrieval
&
Web
Search

Summary
–
vector
space
ranking

 Represent
the
query
as
a
weighted
t‐idf
vector

 Represent
each
document
as
a
weighted
t‐idf
vector

 Compute
the
cosine
similarity
score
for
the
query
 vector
and
each
document
vector

 Rank
documents
with
respect
to
the
query
by
score

 Return
the
top
K
(e.g.,
K
=
10)
to
the
user

44


COMP6714:
Informa2on
Retrieval
&
Web
Search

Ch. 6
Resources
for
today’s
lecture
  IIR
6.2
–
6.4.3

 hvp://www.miislita.com/informa)on‐retrieval‐ tutorial/cosine‐similarity‐tutorial.html

 Term
weigh)ng
and
cosine
similarity
tutorial
for
SEO
folk!

45